65 research outputs found

    Predicting speculation: a simple disambiguation approach to hedge detection in biomedical literature

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>This paper presents a novel approach to the problem of <it>hedge detection</it>, which involves identifying so-called hedge cues for labeling sentences as certain or uncertain. This is the classification problem for Task 1 of the CoNLL-2010 Shared Task, which focuses on hedging in the biomedical domain. We here propose to view hedge detection as a simple disambiguation problem, restricted to words that have previously been observed as hedge cues. As the feature space for the classifier is still very large, we also perform experiments with dimensionality reduction using the method of <it>random indexing</it>.</p> <p>Results</p> <p>The SVM-based classifiers developed in this paper achieves the best published results so far for sentence-level uncertainty prediction on the CoNLL-2010 Shared Task test data. We also show that the technique of random indexing can be successfully applied for reducing the dimensionality of the original feature space by several orders of magnitude, without sacrificing classifier performance.</p> <p>Conclusions</p> <p>This paper introduces a simplified approach to detecting speculation or uncertainty in text, focusing on the biomedical domain. Evaluated at the sentence-level, our SVM-based classifiers achieve the best published results so far. We also show that the feature space can be aggressively compressed using random indexing while still maintaining comparable classifier performance.</p

    Temporal dynamics of semantic relations in word embeddings: an application to predicting armed conflict participants

    Full text link
    This paper deals with using word embedding models to trace the temporal dynamics of semantic relations between pairs of words. The set-up is similar to the well-known analogies task, but expanded with a time dimension. To this end, we apply incremental updating of the models with new training texts, including incremental vocabulary expansion, coupled with learned transformation matrices that let us map between members of the relation. The proposed approach is evaluated on the task of predicting insurgent armed groups based on geographical locations. The gold standard data for the time span 1994--2010 is extracted from the UCDP Armed Conflicts dataset. The results show that the method is feasible and outperforms the baselines, but also that important work still remains to be done.Comment: to appear in EMNLP 2017 proceeding

    Random Indexing Re-Hashed

    Get PDF
    Proceedings of the 18th Nordic Conference of Computational Linguistics NODALIDA 2011. Editors: Bolette Sandford Pedersen, Gunta Nešpore and Inguna Skadiņa. NEALT Proceedings Series, Vol. 11 (2011), 224-229. © 2011 The editors and contributors. Published by Northern European Association for Language Technology (NEALT) http://omilia.uio.no/nealt . Electronically published at Tartu University Library (Estonia) http://hdl.handle.net/10062/16955

    Redefining part-of-speech classes with distributional semantic models

    Full text link
    This paper studies how word embeddings trained on the British National Corpus interact with part of speech boundaries. Our work targets the Universal PoS tag set, which is currently actively being used for annotation of a range of languages. We experiment with training classifiers for predicting PoS tags for words based on their embeddings. The results show that the information about PoS affiliation contained in the distributional vectors allows us to discover groups of words with distributional patterns that differ from other words of the same part of speech. This data often reveals hidden inconsistencies of the annotation process or guidelines. At the same time, it supports the notion of `soft' or `graded' part of speech affiliations. Finally, we show that information about PoS is distributed among dozens of vector components, not limited to only one or two features

    Transfer and Multi-Task Learning for Noun-Noun Compound Interpretation

    Full text link
    In this paper, we empirically evaluate the utility of transfer and multi-task learning on a challenging semantic classification task: semantic interpretation of noun--noun compounds. Through a comprehensive series of experiments and in-depth error analysis, we show that transfer learning via parameter initialization and multi-task learning via parameter sharing can help a neural classification model generalize over a highly skewed distribution of relations. Further, we demonstrate how dual annotation with two distinct sets of relations over the same set of compounds can be exploited to improve the overall accuracy of a neural classifier and its F1 scores on the less frequent, but more difficult relations.Comment: EMNLP 2018: Conference on Empirical Methods in Natural Language Processing (EMNLP

    Evaluating Semantic Vectors for Norwegian

    Get PDF
    In this article, we present two benchmark data sets for evaluating models of semantic word similarity for Norwegian. While such resources are available for English, they did not exist for Norwegian prior to this work. Furthermore, we produce large-coverage semantic vectors trained on the Norwegian Newspaper Corpus using several popular word embedding frameworks. Finally, we demonstrate the usefulness of the created resources for evaluating performance of different word embedding models on the tasks of analogical reasoning and synonym detection. The benchmark data sets and word embeddings are all made freely available

    Improving cross-domain dependency parsing with dependency-derived clusters

    Get PDF
    This paper describes a semi-supervised approach to improving statistical dependency parsing using dependency-based word clusters. After applying a baseline parser to unlabeled text, clusters are induced using K-means with word features based on the dependency structures. The parser is then re-trained using information about the clusters, yielding improved parsing accuracy on a range of different data sets, including WSJ and the English Web Treebank. We report improved results using both in-domain and out-of-domain data, and also include a comparison with using n-gram-based Brown clustering
    corecore